ANLY 500 will focus on the foundations of:
The first question we must ask ourselves in this course is: What is Analytics?
We should note analytics can be defined in two ways!
The utilization of:
The focus of data analytics can be defined under three scopes, including:
Dataset: “Sunspot Trends from 1749-01-01 to 2013-09-01’”
Description: Understand the Historical Trend of Sunspots from 1749 to 2013.
library(ggplot2)
sunspot.month <- as.data.frame(sunspot.month)
sunspot.month$Time <- 1:nrow(sunspot.month)
ggplot(sunspot.month, aes(x = Time, y = x)) +
geom_point(alpha = 0.5) +
ylab("Number of Sunspots") +
xlab("Time") +
theme_classic()library(quantmod)
start <- as.Date(Sys.Date()-(365*5))
end <- as.Date(Sys.Date()-2)
getSymbols("AMZN", src = "yahoo", from = start, to = end)
#> [1] "AMZN"
str(AMZN)
#> An 'xts' object on 2016-02-16/2021-02-11 containing:
#> Data: num [1:1258, 1:6] 519 529 541 521 542 ...
#> - attr(*, "dimnames")=List of 2
#> ..$ : NULL
#> ..$ : chr [1:6] "AMZN.Open" "AMZN.High" "AMZN.Low" "AMZN.Close" ...
#> Indexed by objects of class: [Date] TZ: UTC
#> xts Attributes:
#> List of 2
#> $ src : chr "yahoo"
#> $ updated: POSIXct[1:1], format: ...predictive_model <- lm(formula = AMZN.Close ~ AMZN.High + AMZN.Low + AMZN.Volume,
data = AMZN[1:1199,])
summary(predictive_model)
#>
#> Call:
#> lm(formula = AMZN.Close ~ AMZN.High + AMZN.Low + AMZN.Volume,
#> data = AMZN[1:1199, ])
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -95.013 -5.817 -0.340 5.629 103.543
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) -0.2879421516 1.7044520384 -0.169
#> AMZN.High 0.4554673789 0.0251012225 18.145
#> AMZN.Low 0.5459353425 0.0258159562 21.147
#> AMZN.Volume 0.0000001222 0.0000002929 0.417
#> Pr(>|t|)
#> (Intercept) 0.866
#> AMZN.High <0.0000000000000002 ***
#> AMZN.Low <0.0000000000000002 ***
#> AMZN.Volume 0.677
#> ---
#> Signif. codes:
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 15.65 on 1195 degrees of freedom
#> Multiple R-squared: 0.9995, Adjusted R-squared: 0.9995
#> F-statistic: 8.02e+05 on 3 and 1195 DF, p-value: < 0.00000000000000022par(mfrow=c(2,3))
plot(predictive_model,1)
plot(predictive_model,2)
plot(predictive_model,3)
plot(predictive_model,4)
plot(predictive_model,5)Analytics is the discovery, interpretation and communication of meaningful patterns or summary of data using data analytics.
Now we should be asking the question: What is Data Analytics?
High level analysis techniques commonly used in data analytics include:
However, two other types of analysis may be considered.
Quantitative data analysis: involves analysis of numerical data with quantifiable variables that can be compared or measured statistically.
Qualitative data analysis: it is more interpretive. It focuses on understanding the content of non-numerical data like text, images, audio and video, including common phrases, themes and points of view.
In other words, formulate a question that needs to be answered.
Test the concept:
Theory:
Hypothesis:
Falsification:
Independent Variable:
Dependent Variable:
Data is a set of values/measurements of quantitative or qualitative variables.
In a dataset, we can distinguish two types of variables:
Definition - entities that are divided into distinct categories.
Includes the following:
R stores categorical variables as a factor or character.
Factors are the variables in R which take on a limited number of different values.
Definition - a binary variable is only two categories.
Definition - A nominal variable is more than two categories.
Definition - A ordinal variable is the same as a nominal, but the categories have a logical order.
In addition to being able to classify values into categories, you can order the categories: first, second, third
Definition - entities get a distinct score.
Includes the following:
Definition - A interval variable is equal intervals on the variable. It represents equal differences in the property being measured. This variable also does not have a true zero.
Definition - A ratio variable is the same as an interval variable, but the ratios of scores on the scale must also make sense. This variable does have a true zero.
The accuracy of the measurements are key to your solutions.
Measurement Error: - aka observational error
Definition - The discrepancy between the actual value we’re trying to measure, and the number we use to represent that value.
Validity:
Including the following:
Reliability:
Test-Retest Reliability:
To use measures in any research and test them we must now understand the following: How to Measure?
It is different for certain types of research, including:
Definition - One or more variables is systematically manipulated to see their effect (alone or in combination) on an outcome variable.
Cause and Effect (Hume, 1748)
Confounding variables: the ‘Tertium Quid’
Ruling out confounds (Mill, 1865)
Considering the what & how to measure, we must now look at the methods of data collection.
For instance:
Between-group/between-subject/independent
Repeated-measures (within-subject)
Systematic Variation
Unsystematic Variation
Randomization
First, populations and samples should be understood so that your analysis is not misleading when interpreting results.
Population
Sample
A simple statistical model can be used to analyze data.
For instance, the mean is a hypothetical value.
The numbers estimated from a single test/study/experiment are considered a sample.
Parameters = Greek Symbols
Statistics = Latin Letters
To analyze the data and generate interpretable results the following statistical models can be used:
In this lecture, you have learned: